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Introduction 



The thesis of this paper Is that achievement tests have changed their primary 
function from serving as indicators of educational accomplishments. They have, in 
addition, become Instruments of educational policy and have come to be regarded as 
effective means to alter educational achievement and productivity. I will explore this 
assertion by using examples of research and development from state and national 
testinc activities. I will also consider how these alternative functions affect system 
behavior,' 'egitimate policy inferences, technical requirements of tests, and ultimately 
our understanding of educational quality. 

Educational testing has long been with us, but recently has demanded new levels 
of attention as states and the Federal government have increased their investment in 
and attention to the problem of measuring educational achievement. The function of 
tests used to be straightforward: to find out what some person knew or could do. Tests 
of this sort were given in schools to all of us. These tests were most often idiosyncratic 
In their design, made up, as they were, by one or more teachers. Such tests might have 
appeared to be very formal or even frightening to students, but their creation was 
informal that is, the content they included and the standards used for their scoring 
were the' decisions of teachers, people with close understanding of classrroni 
instruction. 3ven bureaucratically entrenched and successful tests, such as the New 
York State Regents Examinations, were reasonably flexible, in that they were developed 
by teams of teachers and test writers and changed annually to reflect transitions and 
modifications in the curriculum. 

The intellectual roots of the standardized test enterprise have been well 
documented (Coleman, Cronbach & Suppes, 1969). Driven by pressing national needs 
to make personnel decisions, during years of war, the mechanics of test design, 
administration, and analysis became more refined, more esoteric, and in turn, more 
credible to a technologically-oriented society. 

For example, test based selection for admission to higher education, using the 
Scholastic Aptitude Examination (SAT), has been a regular part of students experience 
for about the last forty years. Yet, the SAT became regarded as an end rather than as a 
tool in many conversations about education. When the scores on the SAT declined (See 
Harnischfeger & Wiley, 1975, and Wirtz et. al., 1977 for analyses), inferences were 
drawn about school effectiveness, even though that test was never designed to measure 
the goals of educational programs. The concurrent rise In the minimum competency 
test movement, where students needed to pass examinations on certain basic sidlls for 
graduation from hl.^h school or even for grade to grade Promotion, further moved 
testing into the mainstream of American policy options. Oaeger & Tittle, 1980). In 
minlrnum competency programs, in the 1970s in particular, the existence of the test 
tricfiered a variety of policies that dramatically changed the rules (Lazarus, 1981). 
Appropriate test performance became a goal, by almost any "^eans necessary, and at 
first at almost any cost as well, including long-term effects on children (Cohen & Haney, 
1980; Kennedy, 1980). The most recent phase of testing involves a conceptual 
extension of the idea of minimum competency to content and skills purportedly 
demonstrating higher levels of subject matter competence (see the positions of Hirsch. 
1987, and Finn & Ravitch, 1987), from minimal to optlmal-but more about this topic 
later. 



The Policy Attractiveness of Testing 

Even though the testing of students and teachers remains controversial and 
occasionally the subject of litigation and Judicial review, many policy makers in school 
districts, statehouses, and at the Federal level continue to see higher standards as he 
cornerstone of educational reform efforts and tests as their operational implementation. 
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Why'> Even as the phrase, "There are no quick fixes," grows more popular in our 
rhetoric, we still continue our search for the chimera. Testing is assumed to be a 
relatively expedient remedy and magnetically continues to attract policy proponents, 
advocates who are In turn supported by a well-connected commercial testing Industry. 

What is it that testing seems to offer? I believe that testing suggests a wealth of 
metaphors, the most clear of which is based on the image of good management. Testing 
provides a "We mean business" orientation and functions as a lever on the efficiency 
and effectiveness of educational organizations. It provides a mechanism that promises 
to demonstrate how schools can be focused and be made more efficient. (See, for 
example, Kirst, 1981, p. 61.) In the most simple terms testing sends the me«age that 
schools (and the expenditures that support them) can be managed. Thus, testing offers 
a convenient communication vehicle and one that ii backed up by sanctions. The 
content of tests say "Ulii Is important! Pay attention!" Societal ascriptions of test 
importance are functionally derived from how tests are communicated to and 
interpreted by policy makers and the public. The stability of this perceived importance 
may turn out to be independent of the actual effectiveness of tests in improving 
educational quality. 

Tests cost money, but their costs are relatively small compared to options such as 
addinc teachers o. Investing dramatically in staff development to update teachers 
content knowledge and pedagogical skills. Tests may not, in fact, be a quick fix; but 
they may be a cheaper option than grass-roots restructuring and reform. And they are 
tangible and palpable. Educational reform often deals in ideas, in words, and concepts 
whose distinctions are not well understood by the public. Recall, for instance, the 
public furor that aopped up when it was "New Math" time. Tests almost magically avoid 
such confusion. Everyone knows what a test is. Furthermore, tests may be one of the 
few options that can be imposed top-down (from the statehouse, for instance) and that 
appear to have any effect at all on our diverse educational settings. And as tests become 
pollcv instruments, with public political investments behind them, it is not too 
surprising that their findings take on more portent to those who require them. 

Tests as Indicators 

When tests are seen as one component of a system of outcome measures or 
Indicators rather than the creators of effects, our need to attend to them differs 
dramatically, in some ways, the differences are paradoxical, and depend upon the 
conditions under which the test results are actually used. 

For example, if a test is seen as a policy device ("Teach this because it Is 
Important"), then that which does not appear on the test loses aedibility and currency 
In the school environment. If tests do not Include idence, then science (or art, music, 
history) may not be taught seriously. However, If tests are seen as one of many 
Indicators of data that bear on educational quality IM oel dffinfi iL then our 
teaching and curriculum do not hinge as precariously on what those tests indicate. The 
arguments for test-driven education (Popham, 1974) are persuasive, but I believe they 
ultimately can decrease the long range stability of an education system designed to 
Improve performance for all students. For when tests are Intended not only measure 
the effects of particular reforms but are the reforms themselves, the Interpretation of 
positive and negative patterns of growth are extremely problematic. Interpretations of 
increases or decreases In performance are confusing. (Although increases 
performance are not studied tou diligently; they are usually accented and attributed to 
the most recent set of reforms.) Did the test succeed or fail to communicate new 
standards? Were the programs and Instruction that were put In place Inferior? Are 
t».ere any data conditions under which the policy of testing Is Itself questioned? 
Another difficulty results from the logical interest In looking at tests over time, to Infer 
trends of various sorts for policy action. Such trend analyses place content constraints 



ERIC 



and technical requirements on measurement that limit the real match between what are 
or could be important educational goals and what we, some years earlier, made a 
commitment to measure. Unless these concerns are explicitly accounted for, they can 
only ultimately impede our understanding of educational quality. 

I believe that we are now in a phase where the transformation between tests as 
indicators and tests as policy Instruments Is under way. This transformation Is caused in 
pa? by the staun^ and apparently Impervious belief in the validity or "hardness" of 
measurement data. It also is pushed by the Insidious proposition submerged in the 
Sn o??esting as policy: to^lt. If It Isn't tested, it Isn't important. When a system 
explicitly attempts to measure all the important areas of schooling-a task at which It 
can never succeed-the requirement for Jnduslveness damages the entire educational 
enterprise and unbalances schooling. Part of the damage Is caused btcause tests as 
policy instruments are almost always indirect. Teachers are tested because someone 
neither trusts the quality of their selection and the preparation they receive at colleges 
and professional schools of education nor knows how to influence them. Children are 
tested because we aren't sure teachers know how to teach them. Tests run downhill 
from the issue that we really wish to influence and often onto the people who are the 
recipients rather than the instigators of the suspect policies. 

Maps of State Testing 

In this section, I propose to shift from a general position statement on test 
functions to a relatively detailed description of the topography of state testing. I will 
reoort on state testing activities at a particular point in time, and attempt to provide a 
snapshot of some of the Intense activities in testing at the state level. The purpose of 
this description Is to show what is being tested where, what Investments are being 
made, and to set the stage for a section which follows that demonstrates how our 
educational system responds to such mandates. 

The Impetus for state level testing was multiple, but can undoubtedly be 
attributed to the changes In funding for education derived from the Serrano decision, 
which was related to the state's responsibility to "equalize- educational expenditures 
and preempted responsibility on matters of accountability from local agencies. The 
retreat from programmitic Federal action In education that began with the Reagan 
Administration lodged additional power and initiative at the state level. Testing 
programs, already In place in California, Florida, and New York, became the focus of 
much new state activity. 

How widespread was this activity? Let's start with a time period fo/'o^'ng t^he 
release of A Nation At Risk (National Commission on Excellence In Education, 1983), the 
US Department of Education report that undoubtedly stimulated much state level 
reform. At the end of 1984, 39 states were operating at least one statewide testing 
prouram Thirty-five states were conducting "Assessment Programs, prograrns that were 
to monitor the overall effects of educational services In the state In terms of student 
achievement. Thirty-six states were operating minimum competency testing {M^i) 
programs Twenty-two states had both assessment and MCT programs. These data were 
developed as part of major study on the feasibility of using existing state achlevernent 
S to7rovlde a picture of the national achievement of U.S. students (Burstein, Baker 
Aschbacher, & Keesling, 1985). This study demonstrated clearly that a major Investment 
had been made In testing. The rest of this section will draw upon this study as Its 
primary source of information. 

Who Gets Tested? 

What were state testing programs like? Who were the students tested? Jesting 
In states focused on eighth grade, with a total of 32 programs testing at this level. Other 



frequently tested grades were grades 3, 4, 6, 10, and 11. Least frequent grades tested 
were grades 1, 2, 7, and 12. At the time of the report, 24 of the states were testing all 
students at the target grade level(s)— census testing— and as one would expect, all 
competency programs tested every eligible student. Most states with testing programs 
tested children in more than one grade. What other information was collected about 
the students? The most frequently obtained data were about students' sex and 
ethmdty, although about one-third of the reporting states did not require such 
information. Language status and program participation, e.g.. Chapter 1, were 
information items collected by a relatively few states. Peculiarly, student age and years 
in school were of interest to only one or two states. 

Wliat's on the Tests? 

What content was tested? In almost every case, the content areas tested were in 
reading and mathematics. Fewer than half the states giving tests condu'^ed writing 
assessments, using student essays as the data. But more than half testea in at least one 
additional content area, such as social studies, science, or language arts. This research also 
reports a detailed set of analyses which focused test content. These analyses were 
developed by carefully categorizing the items on actual copies of these tests, or In some 
cases where test security was an issue, by Inspecting the test specifications and sample 
test items. The research team first developed a model to guide our analyses consisting of 
a relatively flat hierarchy of major skills and subskills, shown in Figure 1. Based on this 
kind of analyses, major skill categories were developed for reading, mathematics, and 
writing; these are listed in Table 1. 



Figure 1 

Relations among Content Areas, Major Skill Areas, and Subskills 
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Table 1 

Major Skill Areas Exhibited in State Testing Items 



Reading 



Mathematics 



Writing 



Inferential 
Comprehension 

Literal 

Comprehension 

Vocabulary 

Word Attack 
Sample 



Numbers & Numeration 



Measurement 



Variables 
Geometry 



Grammar 

Word Usage 

Organization 
Writing 



The graphic display presented In Figure 2 was aeated by Leigh Bursteln. A quick 
review will give a good picture of the distribution of skills by content area and grade 
level tested. 

Figure 2 

Distribution of State-Tested Skills 
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The team's analysis was more Intensive. ^ Analysis of the sUll areas was 
decomposed an additional level Into subsklUs, and examples of the type of Items 
measuring such subsklUs were provided. In Table 2 an example of the inferential 
comprehension tasks in reading are presented. 



Table 2 

Decomposition into Subskills and Items for the 
Inferential Comprehension Skill Area in Reading 



1. DETAILS, SUPPORT 
STATEMENTS 



(Given passage) 

Which statement best supports James Lee's claim that 
the late bus would benefit students? 

a. The school board should find a way to resume the 
services of the late bus 

b. Extracurricular activities provide students with 
valuable learning experiences 

c. Some students can get rides from their parents 

d. Some working parents cannot take their children 
home from school 



MAIN IDEA, 
SUMMARY, TITLE 



(Given passage. Infer best title, summary 
statement, title) 

The main Idea of these rules is that: 

a. both adults and children enjoy the swimming 
pool 

b. there is a snack bar at the swimming pool 

c. safety Is extremely Important at the swimming 
pool 

d. the swimming pool Is open every day 



3. MISSING/ 
IRRELEVANT 
INFORMATION 



(Given passage. Infer missing Information 
or Identify Important Information to 
Include or exclude) 

Which of the following would be most Important for 
the editors to include In this editorial? 

a. The school has never given the band any money 
for its uniforms 

b. Helmets and padding protect football players 

from injury , . j 

c. Members of the marching band perform Indoor 

concerts too . 

d. The football team has longer practices than the 
marching band 



iThls work was primarily conducted by Pamela Aschbacher. 
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Table 2, continued 



MISSING WORDS (Given reading passage with several words 

omitted, identify best word to fit in blank 
from context.) 

(Note: New York's entire reading test was 
like this.) 



SEQUENCE (Given a passage, infers order of events or 

logic.) 

What Indicates that Minnie was the first in her 
neighborhood to have a sewing machine? 

a. The neighbor women all came to see It 

b. She had to make everyone's clothes 

c. Fred bought it 

d. She didn't know how to operate it at first 



CAUSE/EFFECT (Given passage, infer cause or effect) 

A major reason Paramount Studio moved to California 
was to: 

2. allow the Army to use the Astoria plant 

b. avoid the destruction of the studio by vandals 

c. enable the Astoria plant to become a museum 

d. be able to make movies less expensively 



CONCLUSIONS (Given passage, chart, etc., draw 

conclusions) 

Based on the information in this chart. It may be 
concluded that: 

a. aoss-ventllatJon helps to warm a room 

b. gas heat Is more expensive than electric heat 

c. fans use very little electricity 

d. insulating walls conserve energy all year round 



PREDICTIONS (Given a passage, predict probable 

outcome) 

What probably happened next in this story? 

a. The girl became angry and went home 

b. Marina and the glil told each other their names 

c. The girl made fun of Marina 

d. Marina became embarrassed and stopped talking 
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Table 2, continued 

FACT/OPINION (Given passage or statement, distinguishes 

fact from opinion) 

Which of the following Is an example of an opinion? 

a. "In 1860, a midwestern stagecoach company let 
people know about an exciting new plan." 

b. "The mail must go through." 

c. "The route cut directly across from Missouri to 
Sacramento." „ ^ 

d. *Each rider rode nonstop for about 100 miles. 



PURPOSE, ATTITUDE (Given pa5 age, infer author's purpose or attitude) 

The .lUwior's attitude toward the Pony Express riders 
can best be described as one of 

a. confusion 

b. amusement 

c. worship 

d. admiration 



CHARACTER (Given passage, identify character traits, identify 

^ motivations, draw conclusions about character's feelings) 

Tlie beasts and birds can best be described as 

a. proud and closed-minded 

b. understanding and wise 

c. sleepy and lazy 

d. thrifty, hard-working 



FIGURATIVE (Given passage, identify meaning of metaphor, simile, 

LANGUAGE idiom, or other image or figure of speech used) 

The author's choice of words "sets up business" and 
"cleaning station" are used to show that 

a. the wrasse's means of getting food is almost like a 
business service 

b. wrasse fishing is big business 

c. all fish set up stations 

d. the wrasse enjoys cleaning itselt in the water 



TONE jtlvcn passage, recognize mood) 

At the beginning of the story, the mood is one of 

a. disappointment and sorrow 

b. curiosity and excitement 

c. fear and suspense 

d. thankfulness and joy 
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Table 2, continued 

14 COMPARE, CONTRAST (Given passage, infer similarities, differences) 

Compared to American managers, Japanese baseball 
managers are 

a. better advisors 

b. better paid 

c. more knowledgeable 

d. more powerful 



15 ORGANIZATION (Given passage, select portion to complete outline 

or organizer based on organization of passage) 

The following outline is based upon the last paragraph of 
the passage. Which topic below is needed to complete it? 

A. Federalirts 

B. Republicans 

a. Competing parties 

b. Jefferson's rivals 

c. Election pay-offs 

d. Stron;* governments 



(Given passage, identify and interpret time, 
place of story or event) 

You can tell that his story took place 

a. in a city park 

b. at a zoo 

c. in a forest 

d. near a boot factory 



17. LIT TYPE (Given passage, recognize example of fiction, nonfiction 

biography, autobiography, similes, metaphors, etc.) 

Ihi reading selection appears to be an example of 

a. an autobiographical account 

b. historical fiction 

c. a biographical sketch 

d. ancient mythology 



Using this analytical framework, the distribution of state efforts was categorized 
In terms of the range of topics covered, the "spread" of Items aaoss subskills, the depth 
of coverage within subskills, or how many Items, and the distribution of Items on 
subskills classified as higher order skills, or skills with cognitive demands of Inference, 
application or problem solving as opposed to mere Information retrieval by students. 
Eleven states were found to have relatively broad subskill coverage. In depth of 
coverage, only California had many Items for each subskill area, a phenomenon directly 
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related to California's matrix sampling approach (Many Items for -^any sWI areas on a 
census test would create a time and fatigue burden for students.) In other states, depth 
was a function of topic and grade level tested. About one third of the states Included 
Wgher o?d^ t^^^ their testing programs. This analysis was used to he P Idem^^^^ 

"hfcommonality of tested skills and was, In fact, conducted, as part o a feas bility study 
to examine aggregation of state tests to serve as a national Indicator of school 
achievement. 

Recently the Office of Technology Assessment (1987) provided an update on 
the grade ceneral content area, and ancillary data collected In state assessments. 
Because h^^^^^^^^^^ cXctlon was within six months of the UCLA study not surprisingly 
no mafor chan2« emerced. Tlie C ..ce of Technology Assessment (OTA) report did 
i^ncSomeXpsTo"^^^^^^^^^ testing policy history and plans J^^^J^^J^t^n^'ln 'm'L 
states V/hat Is striking Is that in almost every case, the move Is to more testing. more 
cradMevels for more subject matters, for wider numbers of students. California cites 
D ans rconsolldat on of local and state measures to meet this state goa (Bennett & 
Son 1986) In addition, the institution of the Golden State Exarninatlons are to 
orovlde iXldual In^^^^^^^ for students Ij achieve higher standards, standards which 
Sl^'demo!^^^^^^^^ appropriate tests; successful students v^I rece^^^^^^^^^^ 

recofinitlon on their diploma. California also uses a cash Incentive p ogram for schools 
?ra1?eXes on their twelfth grade California Assessment Program (CAP) scores. The 
policy investment In this approach Is high. Bennett and Carlson say: 

Standardized tests are expected to focus the attention of educators and 
policy makers at all levels on the knowledge, skills, concepts, and 
processes which are essential for success In the more demanding high- 
tech job market of the future, for responsible atlzenship, and for 
personal fulfillment. The core of content and skills to be spotlighted 
reoresents a rigorous curriculum In the humanities, natural sciences, and 
math and emphasizes higher-order skills such as those required to 
Analyze complex relationships, draw Inferences, and reason deductively. 
Although It Is assumed that In practice, the scope and pace of the 
curriculum will reflect differences In aptitude and intelligence . . . it s 
also assumed that the majority of students are not working up to their 
SoTeS and that It Is the responsibility of the schools to challenge 
them to do so-both for their own good and for the good of the society, 
(p. 169) 

The Colorado state testing summary shows what happens when a test without a 
history of large scale assessment moves Into It. (Martin, 1987). 

Colorado has maintained an approach that Is still at a very general level 
comoared to recent efforts In California, Illinois, or In the Southeastern states. 
Kmeless during their deliberation, the tendency to use such tests as an omnibus 
Xio^ to all S arose. We specifically had conversations with policy makers on 
he cos^s and bS of using such measures as Indicators of teaching performance. We 
hoKeen observing oth'er state efforts that attempt to general ze th^^ 
student testing measures to teacher assessment, an approach that Is clearly a bad Idea on 
conceptual and technical grounds. 

Systemic Responses to lests as the Conveyors of Standards 

In this section, I wish to recount briefly the findings of so"|e °f my^^^^ 
the Center for Research on Eva uation. Standards, and Student lesting (CRESS U. une 
S ourXe resfa ch programs is studying the Impact of testing on educational quality 
OurQuwtion was simply whether having such standards and tests (as aeated by recent 
S^te and o^aUeform efforts) helps or hurts the educational system. One question was 
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whPthPr Mirh tests as those required for high school graduation do in fact have positive 
Tn^aZ Jo^^^^^^^ (1987) has reported on a study of those ten states 

Srfth four or more years of a high school graduation test requirement, nterviews with 
^ucaVor weS^ obtained that described their analysis of the tunct on of such tests. 

wTnt on to select two states with the highest and two wi h the lowest 
craSon ra es A survey was administered to 736 students sampled from within three 
fcDresematlve di^ in each state. On the basis of his survey, Catterall predicts that 
drSTout is sig^^^^^ related to failure on such competency tes s, and finds an 
fnffiorfo H spa^ students. Expectedly, socioeconomic class is also a strong 
Dredirtor as well as "track" in school. Furthermore, such relationships were invisible in 
h s teVms toThe school personnel interviewed. If his findings are confirmed, then the 
?o e of^es s as the conveyors of standards may need some review. Clearly we can 
[mprovrove?Ill performance by driving out poorly performing students. But such a 
function is diametric to our intentions. 

The second study, on the Texas Examination of Current Administrators and 
Tpachers n-ECA-n was Conducted by Lorrle Shepard, a CRESST researcher from the 
Snive? ty of Co^^^^^ K^eitzer, and Graue (1987) did an extensive set of 

wS rtSt^ooked at the TECAT from its policy inception to the results and remed es that 
resulted from Its ad Jnlstrat The TECAT was designed to identify the teachers and 
InXutratm^^ werc not qualified to serve In educational roles. Shepard and her 
?eam ca t th Im^^^^ in the context of the need to revitalize the state 

Irnno-Sv and \he 7ol\ to be a center of high tech Involvement. Shepard traces the 
rn?e^ oT the newly e°ec^^^^^ and the Importance of H. Ross Perot a successful 

pihno o2lst and head of t^^ state task force on schooling. The timing of the Matim 2i 
S r?po?t was'a^^o Shepard points out the TECAT was a "t^racy test meas^^^ 

SiSnllepe and writing skills, with "harsh consequences" (Shepard et. a ., p. 87), since 
Fanfn« the test twice fesS^e^^ in loss of job. The process of preparing Ipr this test was 

ep^rted by Shepar^^^ and seemed to focus on succeeding at the Part cular test Items 
KdPd and not reducing the kinds of grammatical errors ("he don't") that partly 

Ss« Mted DubUc support in the first place. The TECAT had a passing rate of 99%. 
Ssel^sS^ set so low, teachers could still make a.^w flagrant errors and 
nAw Moreover the people the test "Identified" may have been real losses to the 
system sucTas tl^oLwh^ with the institutionally "dentally retarded, sho^^^ 

Sr's X hadn- been certified through the usual means, and ^^ino'^y ^^.f 
ThP test actuallv failed "1,199 teachers with some of the worst gramniar skills. It may 
Ilso hlle forced out ano her 1,000 to 2,000 teachers who considered themselves at risk 
o^the lEest " (^^^^^ et. al. conclude that the TECAT harmed public opinion 

about education and Involved a set of: 

unforeseen consequences: enormous cost, frenetic preparation and 
about the test, demoralized teachers, and . P"bl^,Ji/ Husioned 
bv the hich pass rate. Although these outcomes were not ntended, they 
may be Inevitable features of a reform that hangs so much Importance on 
a test pitched to the lowest level of performance on the lowest of 
teaching skills, (p. 91) 

ThP work of Sheoard et. al. suggests that Intentions are Insufficient to assure 
nosltive me^nd interpretation^ test results. Rudner (1987) reports on the status of 
t^eaXr testing in 4^^^ analyses suggest that the ''"^Pact of such tests on 

m?nnMHP. h« been severe " (p 5) In a paper by Alglna & Legg 1987), the validity and 
S de islon "noce^^^^^ cTven the demonstrable negative effects and 

potent"! vaTldlty pmblems, teacher tests as approaches to reform should be more 
carefully sautlnlzed. 

Finally, Ellwein and Glass (1986), In a study conducted for CRESST, Presented ^ 
series of case studies of standard setting using tests. In a shorter verslor^ of thi study 
(Glass & EI wein, 1986), the authors briefly summarize the six case studies, which 
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InvesilKatcd four states and two local districts. By analyzing the intentions of such 
SSln contrast to their operations and effects, the authors generate a devastating 
set of summary observations. They are: 

1 When standards on tests are raised, safety nets are strung up (in the 

■ form of exemptions, repeated trials, softening cut-scores, tutoring for 
retests and the like) to catch those who fail. If 100 incompetent 
persons enter the arena, 99 will ultimately survive. The one who 
doesn't was probably no less able than many, but lost heart and quit. 

2 Both the courts and professional educators honor the principle that 

■ students should be warned of impending standards and remediated 
when they fail. 

3 Even the most orthodox and doctrinaire justification of cut-scores in 
terms of skills and competence is moderated in the end by 
consideration of pass-fail rates. Norm referencing drives out aiterion 
referencing. Pre-criterion referencing exists only in textbooks and 
scholarly journals; it is not found in the world of practice. 

4. People focus on first-test failure rates and are less interested in 
ultimate failure rates. 

5 In raising educational standards, the more technical looking approach 
pai more political muscle. The language of arbitrary authority is 
despised. The language of technical rationality Is widely honored. 

6. Cut-score determination methods require the added authority of 
political symbols for their credibility (titles, poUtical composition of 
groups of judges, technical authority such as ETS)-these symbols are 
invoked to lend authority to what is actually a quite arbitrary 
procedure. 

7 Managers of the educational system will act to soften the hard edges 
■ of technology and reclaim political disaetion that has been 

appropriated by zealous technologists (in this case, technologis s who 
would turn over the responsibility for determining who graduates 
from high-school or is licensed to teach to a test and a statistical 
standard). 

8 Universities are raising standards, in part as an attempt to get out of 

^business of remedial^ ^ 5*^°^^°^' "^^ni^tnu 

comes in second to economics; competence loses out to enrollments. 

9 In the end, standards are determined by consideration of politically 
' and economically acceptable pass rates, symbolic messages and 

aSpea^ances, and^carcSly at a*ll by a behavioral analysis of necessary 
ffiTnd competencies. The latter are relied on to the exclusion of 
the former to the extent that passing or failing the test has no lasting 
consequences In the lives of either students of teachers, (p. 4) 

Glass and Ellwein distinguish between instrumental and s/'njwj'c arts 
testinc and standard reform squarely In the symbolic category. As testing has shifted 
rorn for?etSS Instruction to a "policy" Imposed from outside, It appears to 

have ir^df of Us^assumed power. The authors conclude with a set of questions 
about standards and testing: 
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What purpuses and political interests are served by raised standards? 
Whatever they are, we suspect they have little to do with the 
accomplishments and chances for "life success' of the pupils in whose 
name the reforms are undertaken. 

What effect is the movement having on schools, teachers and the 
way pupils learn? Schools may be winning renewed public confidence. 
Teachers are bearing the brunt of both the blame for the crisis that 
brought about the reforms and the busy work that the reforms have 
engendered. Pupils take what is dished out and move on. (pp. 5-6) 



The National Testing Scene 

The questions above and the analyses by Catterall, Shepard, et. al., and Glass & 
Ellwein have growing salience In the light of a series of Interesting policy deliberations 
re ^ted to the Federal role ir. testing. The National Assessment of Educational Policy 
S) was created as an indicator^ystem (Tyler, 1965) However, the attempt by the 
Department ^ to create a national picture of educational performance with 

US Infamous "wall chart" has begun to transform NAEP from an Indicator to a reform 
ns rument. The "wall chart," in summary, used college entrance examinations like the 
SAT "o rank states on outcomes from best to worse without regard for socioeconomi«, 
mob Utv S^tud^^^^^ ethnicity. Chief State School Officers attempted to argue for more 
S measu er(ins^^^^^ of against the entire enterprise). In fact, there was a short-term 
reneflTtoTow raSs because it permitted arguing for greater resources from state 
Ieg"sla ures, for reforms, and so forth. Subsequently, NAEP came under review by a 
broadly composed group of scholars and practitioners (Alexander & James, 1986). Part 
0 their dellberatioSs involved the redesign of NAEP to extend Its sampling, reporting 
and interpretation to the fifty states. Thus, rankings or other measures of relative state 
performance would be possible. A series of discussions, planning act vi ties, and now 
Sropo^d legislation are moving this process along. While such a system would 
SXmedly be an improvement over the SAT score base for state comparisons, the use 
of NAEP for such a function raises serious issues of the sort raised earlier in this paper 
and by my colleagues cited above. 

State bv state reporting would change NAEP from an indicator to a policy 
Instrumen since It would undoubtedly drive states to attempt to Increase their relative 
standing. However, the existence of such a salient national measure has other 
imnlkations First the content and skills tested would have to have widespread 
I7eement amonfC states. Clearly, to avoid making inappropriate Inferences, 
Ks would want NAEP t6 conform as closely as possible to historical testing programs. 
If accurate such a desire would drop content to the lowest common denominator. 
LTnXthe desirrfor trend data would tend to have a constrictive effect on the 
addmon of new approaches to testing or to the addition or substitution of content areas. 
™rd t ie aeation of such a test would result in a de facto national test and curriculum. 
While some claim that a national curriculum Is already In place, created by the text 
publishi^^^ the federalizing of standards through testing raises an alternative 

set of questions. 

Further a consequent of the adoption of NAEP with supplemental state funding 
as a proxy state educational outcome measure would be to reduce existing state 
a«p«ment Droerams since they will compete for some of the same funds. Thus, 
awe^^ty^^^^ conditions, a number of the concerns articulated by 

Shepard and Glass & Ellwein come Into play. 

Another set of concerns Involve the national policy scene. A large investrnent 
in NAEP may reduce the actual support for other Indicators of national achievement, 
such as the studies of comparative U.S. performance conducted through the 
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International Education Association (lEA) and supported by both government and 
private foundations. If NAEP becomes the megaoutcome measure of performance then 
It will be used to assess a range of educational policy effects. Clearly, the danger of using 
a single measure Is that It can produce anomalous results, as has been demonstrated In 
the recent difficulty In the 1986 NAEP reading scores (Rothman, 1988). 

Without doubt, the functions of tests will continue to evolve. We can hope that 
continued, parallel research analysis of their actual functions becomes a regular part of 
the ImDlemcntatlon or strong modification of any of the major testing programs. Tests 
as metaphors, signs, and symbols are important, but no less than their actual effects on 
educational quality and the people who participate within our educational system. 
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